164 research outputs found
Mass accumulation rate changes in Chinese loess during MIS 2, and asynchrony with records from Greenland ice cores and North Pacific Ocean sediments during the Last Glacial Maximum
Sensitivity-corrected quartz optically stimulated luminescence (OSL) dating methods have been widely accepted as a promising tool for the construction of late Pleistocene chronology and mass or dust accumulation rates (MARs or DARs) on the Chinese Loess Plateau (CLP). Many quartz OSL ages covering marine isotope stage (MIS) 2 (equal to L1-1 in Chinese loess) have been determined for individual sites within the CLP in the past decade. However, there is still a lack of detailed MAR or DAR reconstruction during MIS 2 across the whole of the CLP. Here, we present detailed MARs determined for eight sites with closely-spaced quartz OSL ages covering MIS 2, and relative MARs suggested by a probability density analysis of 159 quartz OSL ages ranging from 30 to 10 ka ago, from 15 sites on the CLP. The results show enhanced dust accumulation during the Last Glacial Maximum (LGM), with particularly rapid dust accumulation from 23 to 19 ka ago (the late LGM). In contrast, MARs determined for the last deglaciation (from 19 to 12 ka ago) are low. The MAR changes during MIS 2 in Chinese loess are mainly controlled by the East Asian winter monsoon (EAWM) intensity, which is forced by Northern Hemisphere ice volume. The MAR changes also indicate that dust accumulation during MIS 2 is generally continuous at millennial time scales on the CLP. Comparison of Asian-sourced aeolian dust MARs in Chinese loess with those preserved in Greenland ice cores and North Pacific Ocean sediments indicates that rapid dust accumulation occurred from 26 to 23 ka ago (the early LGM) in Greenland ice cores and North Pacific Ocean sediments, suggesting a several kilo-year difference in timing when compared with the rapid dust accumulation during the late LGM in Chinese loess. This asynchronous timing in enhanced dust accumulation is probably related to both changes in the EAWM intensity and changes in the mean position of zone axis of the Westerly jet, both of which are greatly influenced by Northern Hemisphere ice volume. This study highlights the possible influence of changes in the mean position of zone axis of the Westerly jet on long-range transport of Asian-sourced dust.</p
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis
There has been significant progress in emotional Text-To-Speech (TTS)
synthesis technology in recent years. However, existing methods primarily focus
on the synthesis of a limited number of emotion types and have achieved
unsatisfactory performance in intensity control. To address these limitations,
we propose EmoMix, which can generate emotional speech with specified intensity
or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS
model based on a diffusion probabilistic model and a pre-trained speech emotion
recognition (SER) model used to extract emotion embedding. Mixed emotion
synthesis is achieved by combining the noises predicted by diffusion model
conditioned on different emotions during only one sampling process at the
run-time. We further apply the Neutral and specific primary emotion mixed in
varying degrees to control intensity. Experimental results validate the
effectiveness of EmoMix for synthesizing mixed emotion and intensity control.Comment: Accepted by 24th Annual Conference of the International Speech
Communication Association (INTERSPEECH 2023
Improving Music Genre Classification from multi-modal properties of music and genre correlations Perspective
Music genre classification has been widely studied in past few years for its
various applications in music information retrieval. Previous works tend to
perform unsatisfactorily, since those methods only use audio content or jointly
use audio content and lyrics content inefficiently. In addition, as genres
normally co-occur in a music track, it is desirable to capture and model the
genre correlations to improve the performance of multi-label music genre
classification. To solve these issues, we present a novel multi-modal method
leveraging audio-lyrics contrastive loss and two symmetric cross-modal
attention, to align and fuse features from audio and lyrics. Furthermore, based
on the nature of the multi-label classification, a genre correlations
extraction module is presented to capture and model potential genre
correlations. Extensive experiments demonstrate that our proposed method
significantly surpasses other multi-label music genre classification methods
and achieves state-of-the-art result on Music4All dataset.Comment: Accepted by ICASSP 202
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
Better disentanglement of speech representation is essential to improve the
quality of voice conversion. Recently contrastive learning is applied to voice
conversion successfully based on speaker labels. However, the performance of
model will reduce in conversion between similar speakers. Hence, we propose an
augmented negative sample selection to address the issue. Specifically, we
create hard negative samples based on the proposed speaker fusion module to
improve learning ability of speaker encoder. Furthermore, considering the
fine-grain modeling of speaker style, we employ a reference encoder to extract
fine-grained style and conduct the augmented contrastive learning on global
style. The experimental results show that the proposed method outperforms
previous work in voice conversion tasks.Comment: Accepted by the 21st IEEE International Symposium on Parallel and
Distributed Processing with Applications (IEEE ISPA 2023
DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks
Generating realistic talking faces is a complex and widely discussed task
with numerous applications. In this paper, we present DiffTalker, a novel model
designed to generate lifelike talking faces through audio and landmark
co-driving. DiffTalker addresses the challenges associated with directly
applying diffusion models to audio control, which are traditionally trained on
text-image pairs. DiffTalker consists of two agent networks: a
transformer-based landmarks completion network for geometric accuracy and a
diffusion-based face generation network for texture details. Landmarks play a
pivotal role in establishing a seamless connection between the audio and image
domains, facilitating the incorporation of knowledge from pre-trained diffusion
models. This innovative approach efficiently produces articulate-speaking
faces. Experimental results showcase DiffTalker's superior performance in
producing clear and geometrically accurate talking faces, all without the need
for additional alignment between audio and image features.Comment: submmit to ICASSP 202
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis
Recent expressive text to speech (TTS) models focus on synthesizing emotional
speech, but some fine-grained styles such as intonation are neglected. In this
paper, we propose QI-TTS which aims to better transfer and control intonation
to further deliver the speaker's questioning intention while transferring
emotion from reference speech. We propose a multi-style extractor to extract
style embedding from two different levels. While the sentence level represents
emotion, the final syllable level represents intonation. For fine-grained
intonation control, we use relative attributes to represent intonation
intensity at the syllable level.Experiments have validated the effectiveness of
QI-TTS for improving intonation expressiveness in emotional speech synthesis.Comment: Accepted by ICASSP 202
FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework
This paper integrates graph-to-sequence into an end-to-end text-to-speech
framework for syntax-aware modelling with syntactic information of input text.
Specifically, the input text is parsed by a dependency parsing module to form a
syntactic graph. The syntactic graph is then encoded by a graph encoder to
extract the syntactic hidden information, which is concatenated with phoneme
embedding and input to the alignment and flow-based decoding modules to
generate the raw audio waveform. The model is experimented on two languages,
English and Mandarin, using single-speaker, few samples of target speakers, and
multi-speaker datasets, respectively. Experimental results show better prosodic
consistency performance between input text and generated audio, and also get
higher scores in the subjective prosodic evaluation, and show the ability of
voice conversion. Besides, the efficiency of the model is largely boosted
through the design of the AI chip operator with 5x acceleration.Comment: Accepted by The 35th IEEE International Conference on Tools with
Artificial Intelligence. (ICTAI 2023
PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Voice conversion as the style transfer task applied to speech, refers to
converting one person's speech into a new speech that sounds like another
person's. Up to now, there has been a lot of research devoted to better
implementation of VC tasks. However, a good voice conversion model should not
only match the timbre information of the target speaker, but also expressive
information such as prosody, pace, pause, etc. In this context, prosody
modeling is crucial for achieving expressive voice conversion that sounds
natural and convincing. Unfortunately, prosody modeling is important but
challenging, especially without text transcriptions. In this paper, we firstly
propose a novel voice conversion framework named 'PMVC', which effectively
separates and models the content, timbre, and prosodic information from the
speech without text transcriptions. Specially, we introduce a new speech
augmentation algorithm for robust prosody extraction. And building upon this,
mask and predict mechanism is applied in the disentanglement of prosody and
content information. The experimental results on the AIShell-3 corpus supports
our improvement of naturalness and similarity of converted speech.Comment: Accepted by the 31st ACM International Conference on Multimedia
(MM2023
- …